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Abstract 



The relationship between the Bayesian approach and the minimum description length ap- 
proach is established. We sharpen and clarify the general modeling principles MDL and MML, 
' abstracted as the ideal MDL principle and defined from Bayes's rule by means of Kolmogorov 

I 1 . complexity. The basic condition under which the ideal principle should be applied is encap- 

^ I sulated as the Fundamental Inequality, which in broad terms states that the principle is valid 

O ■ when the data are random, relative to every contemplated hypothesis and also these hypotheses 

are random relative to the (universal) prior. Basically, the ideal principle states that the prior 
probability associated with the hypothesis should be given by the algorithmic universal proba- 
^ , bility, and the sum of the log universal probability of the model plus the log of the probability 

of the data given the model should be minimized. If we restrict the model class to the finite sets 
then application of the ideal principle turns into Kolmogorov's minimal sufficient statistic. In 
general we show that data compression is almost always the best strategy, both in hypothesis 
■ identification and prediction. 
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X ' 1 Introduction 



It is widely believed that the better a theory compresses the data concerning some phenomenon 
under investigation, the better we have learned, generalized, and the better the theory predicts 
unknown data. This belief is vindicated in practice and is a form of "Occam's razor" paradigm 
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about "simplicity" but apparently has not been rigorously proved in a general setting. Here we 
show that data compression is almost always the best strategy, both in hypotheses identification 
by using an ideal form of the minimum description length (MDL) principle and in prediction of 
sequences. To demonstrate these benificial aspects of compression we use the Kolmogorov theory 
of complexity [12| to express the optimal effective compression. We identify precisely the situations 
in which MDL and Bayesianism coincide and where they differ. 

Hypothesis identification To demonstrate that compression is good for hypothesis identifica- 
tion we use the ideal MDL principle defined from Bayes's rule by means of Kolmogorov complexity, 
Section ^ This transformation is valid only for individually random objects in computable dis- 
tributions; if the contemplated objects are nonrandom or the distributions are not computable 
then MDL and Bayes's rule may part company. Basing MDL on first principles we probe below 
the customary presentation of MDL as being justified in and of itself by philosophical persuasion 
|21, 22|. The minimum message length (MML) approach, while relying on priors, in practice is a 



related approach |31, 32 1. Such approaches balance the complexity of the model (and its tendency 
for overfitting) against the preciseness of fitting the data (the error of the hypothesis). Our analysis 
gives evidence why in practice Bayesianism is prone to overfitting and MDL isn't. 

Ideal MDL We are only interested in the following common idea shared between all MDL-like 
methods: "Select the hypothesis which minimizes the sum of the length of the description of the 
hypothesis (also called "model" ) and the length of the description of the data relative to the hypoth- 
esis." We take this to mean that every contemplated individual hypothesis and every contemplated 
individual data sample is to be maximally compressed: the description lengths involved should be 
the shortest effective description lengths. We use "effective" in the sense of "Turing computable," 
p8| . Shortest effective description length is asymptotically unique and objective and known as the 
Kolmogorov complexity |]l2[ of the object being described. Thus, "ideal MDL" is a Kolmogorov 
complexity based form of the minimum description length principle. In order to define ideal MDL 
from Bayes's rule we require some deep results due to L.A. Levin [|l6| and P. Gacs [^] based on 
the novel notion of individual randomness of objects as expressed by P. Martin-Lof's randomness 
tests We show that the principle is valid when a basic condition encapsulated as the "Funda- 
mental Inequality" (|lO|) in Section |2| is satisfied. Broadly speaking this happens when the data are 
random, relative to each contemplated hypothesis, and also these hypotheses are random relative 
to the contemplated prior. The latter requirement is always satisfied for the so-called "universal" 
prior. Under those conditions ideal MDL, Bayesianism, MDL, and MML, select pretty much the 
same hypothesis. Theorem ^ states that minimum description length reasoning using shortest effec- 
tive descriptions coincides with Bayesian reasoning using the universal prior distribution |16, |^, 
provided the minimum description length is achieved for those hypotheses with respect to which 
the data sample is individually random (in the sense of Martin-Lof). If we restrict the model class 
to finite sets then this procedure specializes to Kolmogorov's minimal sufficient statistics, |T7| . 

Kolmogorov complexity We recapitulate the basic definitions in Appendix ^ in order to 
establish notation. Shortest effective descriptions are "effective" in the sense that we can compute 
the described objects from them. Unfortunately, 34], there is no general method to compute 



the length of a shortest description (the Kolmogorov complexity) from the object being described. 
This obviously impedes actual use. Instead, one needs to consider recursive approximations to 
shortest descriptions, for example by restricting the allowable approximation time. This course 
is followed in one sense or another in the practical incarnations such as MML and MDL. There 
one often uses simply the Shannon-Fano code, which assigns prefix code length l^ := — logP(x) to 
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X irrespective of the regularities in x. If P{x) = 2^'"" for every x E {0, 1}", then the code word 
length of an all-zero x equals the code word length of a truly irregular x. While the Shannon- Fano 
code gives an expected code word length close to the entropy, it does not distinguish the regular 
elements of a probability ensemble from the random ones. 

Universal probability distribution Just as the Kolmogorov complexity measures the short- 
est effective description length of an object, the universal probability measures the greatest effective 
probability. Both notions are objective and absolute in the sense of being recursively invariant by 
Church's thesis, |17]. We give definitions in Appendix p. We use universal probability as a universal 
prior in Bayes's rule to analyze ideal MDL. 

Martin-L6f randomness The common meaning of a "random object" is an outcome of a 
random source. Such outcomes have expected properties but particular outcomes may or may 
not possess these expected properties. In contrast, we use the notion of randomness of individual 
objects. This elusive notion's long history goes back to the initial attempts by von Mises, |29|, 
to formulate the principles of application of the calculus of probabilities to real-world phenomena. 
Classical probability theory cannot even express the notion of "randomness of individual objects." 
Following almost half a century of unsuccessful attempts, the theory of Kolmogorov complexity. 



1 12 1, and Martin-Lof tests for randomness, ||l^, finally succeeded in formally expressing the novel 
notion of individual randomness in a correct manner, see [^]. Every individually random object 
possesses individually all effectively testable properties that are only expected for outcomes of the 
random source concerned. It will satisfy all effective tests for randomness — known and unknown 
alike. In Appendix |^ we recapitulate the basics. 

Two-part codes The prefix-code of the shortest effective descriptions gives an expected code 
word length close to the entropy and also compresses the regular objects until all regularity is 
squeezed out. All shortest effective descriptions are completely random themselves, without any 
regularity whatsoever. The MDL idea of a two-part code for a body of data D is natural from 
the perspective of Kolmogorov complexity. If D does not contain any regularities at all, then it 
consists of purely random data and the hypothesis is precisely that. Assume that the body of data 
D contains regularities. With help of a description of those regularities (a model) we can describe 
the data compactly. Assuming that the regularities can be represented in an effective manner 
(that is, by a Turing machine), we encode the data as a program for that machine. Squeezing all 
effective regularity out of the data, we end up with a Turing machine representing the meaningful 
regular information in the data together with a program for that Turing machine representing the 
remaining meaningless randomness of the data. This intuition finds its basis in the Definitions |8| and 
P in Appendix ^ However, in general there are many ways to make the division into meaningful 
information and remaining random information. In a painting the represented image, the brush 
strokes, or even finer detail can be the relevant information, depending on what we are interested in. 
What we require is a rigorous mathematical condition to force a sensible division of the information 
at hand in a meaningful part and a meaningless part. One way to do this in a restricted setting 
where the hypotheses are finite sets was suggested by Kolmogorov at a Tallin conference in 1973 
and published in |T^. See 17| and Section 2A. Given data D, the goal is to identify the "most 
likely" finite set A of which D is a "typical" element. For this purpose we consider sets A such that 
D £ A and we represent A by the shortest program A* that computes the characteristic function 
of A. The Kolmogorov minimal sufficient statistic is the shortest A*, say Aq associated with the 
set Aq, over all A containing D such that the two-part description consisting of Aq and log(i(j4o) 
is as as short as the shortest single program that computes D without input. This definition is 
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non- vacuous since there is a two-part code (based on hypothesis Ajj = {D}) that is as concise as 
the shortest single code. 

The shortest two-part code must be at least as long as the shortest one-part code. Therefore, 
the description of D given cannot be significantly shorter than log(i(^o)- By the theory of 
Martin-Lof randomness in Appendix ^ this means that D is a "typical" element of A. The ideal 
MDL principle expounded in this paper is essentially a generalization of the Kolmogorov minimal 
sufficient statistic. 

Note that in general finding a minimal sufficient statistic is not recursive. Similarly, even 
computing the MDL optimum in a much more restricted class of models may run in computation 
difficulties since it involves finding an optimum in a large set of candidates. In some cases one can 
approximate this optimum, [^, |3^ . 

Prediction The best single hypothesis does not necessarily give the best prediction. For exam- 
ple, consider a situation where we are given a coin of unknown bias p of coming up "heads" which 
is either pi = | or p2 = § • Suppose we have determined that there is probability | that p = pi and 
probability ^ that p = P2- Then the "best" hypothesis is the most likely one: p = pi which predicts 
a next outcome "heads" as having probability |. Yet the best prediction is that this probability is 
the expectation of throwing "heads" which is 

2 1 4 
3^^ + = 9- 

Thus, the fact that compression is good for hypothesis identification problems does not imply 
that compression is good for prediction. In Section ^ we analyze the relation between compression 
of the data sample and prediction in the very general setting of R. Solomonoff |]25| , p^ ] . We explain 
Solomonoff 's prediction method using the universal distribution. We show that this method is not 
equivalent to the use of shortest descriptions. Nonetheless, we demonstrate that compression of 
descriptions almost always gives optimal prediction. 



Scientific inference The philosopher D. Hume (1711-1776) argued [11| that true induction is 
impossible because we can only reach conclusions by using known data and methods. Therefore, 
the conclusion is logically already contained in the start configuration. Consequently, the only form 
of induction possible is deduction. Philosophers have tried to find a way out of this deterministic 
conundrum by appealing to probabilistic reasoning such as using Bayes's rule Q. One problem 
with this is where the "prior probability" one uses has to come from. Unsatisfactory solutions have 
been proposed by philosophers like R. Carnap Q and K. Popper [pO| . 

Essentially, combining the ideas of Epicurus, Ockham, Bayes, and modern computability the- 
ory, Solomonoff p5| , has successfully invented a "perfect" theory of induction. It incorporates 
Epicurus's multiple explanations idea, since no hypothesis that is still consistent with the data 
will be eliminated. It incorporates Ockham's simplest explanation idea since the hypotheses with 
low Kolmogorov complexity are more probable. The inductive reasoning is performed by means of 
the mathematically sound rule of Bayes. 

Comparison with Related Work Kolmogorov's minimal sufficient statistics deals with hy- 
pothesis selection where the considered hypotheses are finite sets of bounded cardinality. Ideal 
MDL hypothesis selection generalizes this procedure to arbitrary settings. It is satisfying that our 
findings on ideal MDL confirm the validity of the "real" MDL principle which rests on the idea 
of stochastic complexity. The latter is defined in such a way that it represents the shortest code 
length only for almost all data samples (stochastically speaking the "typical" ones) for all models 
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with real parameters in certain classes of probabilistic models except for a set of Lebesgue mea- 
sure zero, 0, |l9|. Similar results concerning probability density estimation by MDL are given 
in Ip. These references consider probabilistic models and conditions. We believe that in many 
current situations the models are inherently non-probabilistic as, for example, in the transmission 
of compressed images over noisy channels, |27]. Our algorithmic analysis of ideal MDL is about 



such non-probabilistic model settings as well as probabilistic ones (provided they are recursive). 
The results are derived in a nonprobabilistic manner entirely different from the cited papers. It 
is remarkable that there is a close agreement between the real properly articulated MDL principle 
and our ideal one. The ideal MDL principle is valid in case the data is individually random with 
respect to the contemplated hypothesis and the latter is an individually random element of the 
contemplated prior. Individually random objects are in a rigorous formal sense "typical" objects 
in a probability ensemble and together they constitute allmost all such objects (all objects except 
for a set of Lebesgue measure zero in the continuous case). The nonprobabilistic expression of the 
range of validity of "ideal MDL." implies the probabilistic expressions of the range of validity of 
the "real MDL" principle. 

Our results are more precise than the earlier probabilistic ones in that they explicitly identify 
the "excepted set of Lebesgue measure zero" for which the principle may not be valid as the set 
of "individually nonrandom elements." The principle selects models such that the presented data 
are individually random with respect to these models: if there is a true model and the data are 
not random with respect to it then the principle avoids this model. This leads to a mathematical 
explanation of correspondences and differences between ideal MDL and Bayesian reasoning, and in 
particular it gives some evidence under what conditions the latter is prone to overfitting while the 
former isn't. 



2 Ideal MDL 

The idea of predicting sequences using shortest effective descriptions was first formulated by R. 
Solomonoff, |2^]. He uses Bayes's formula equipped with a fixed "universal" prior distribution. 
In accordance with Occam's dictum, that distribution gives most weight to the explanation that 



compresses the data the most. This approach inspired Rissanen [21, to formulate the MDL 
principle. Unaware of Solomonoff 's work Wallace and his co-authors |31, 32 1 formulated a related 
but somewhat different Minimum Message Length (MML) principle. 

We focus only on the following central ideal version which we believe is the essence of the matter. 
Indeed, we do not even care about whether we deal with statistical or deterministic hypotheses. 

Definition 1 Given a sample of data, and an effective enumeration of models, ideal MDL selects 
the model with the shortest effective description that minimizes the sum of 

• the length, in hits, of an effective description of the model; and 

• the length, in hits, of an effective description of the data when encoded given the model. 
Under certain conditions on what constitutes a "model" and the notion of "encoding given the 



model" this coincides with Kolmogorov's minimal sufficient statistic in Section |2.1| . In the latter 
a "model" is constrained to be a program that enumerates a finite set of data candidates that 
includes the data at hand. Additionally, the "effective description of the data when encoded given 
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the model" is replaced by the logarithm of the number of elements in the set that constitutes 
the model. In ideal MDL we deal with model classes that may not be finite sets, like the set of 
context-free languages or the set of recursive probability density functions. Therefore we require a 
more general approach. 

Intuitively, a more complex hypothesis H may fit the data better and therefore decreases the 
misclassified data. If H describes all the data, then it does not allow for measuring errors. A 
simpler description of H may be penalized by increasing the number of misclassified data. If H 
is a trivial hypothesis that contains nothing, then all data are described literally and there is no 
generalization. The rationale of the method is that a balance in between seems to be required. 

To derive the MDL approach we start from Bayes 's rule written as 

If the hypotheses space Ti. is countable and the hypotheses H are exhausive and mutually exclusive, 
then J2He'H ^i^) — ^ Pi'(-D) = J2H£n P^iD\H)P(H). For clarity and because it is relevant for the 
sequel we distinguish notationally between the given prior probability and the probabilities 

"Pr(-)" that are induced by P(-) and the hypotheses H. Bayes's rule maps input {P{H),D) to 
output Vt{H\D) — the posterior probability. For many model classes (Bernoulli processes, Markov 
chains), as the number n of data generated by a true model in the class increases the total inferred 
probability can be expected to concentrate on the 'true' hypothesis (with probability one for n 
oo). That is, as n grows the weight of the factor Vi{D\H) / 'Pt{D) dominates the influence of the 
prior P(-) for typical data — by the law of large numbers. The importance of Bayes's rule is that the 
inferred probability gives us as much information as possible about the possible hypotheses from 
only a small number of (typical) data and the prior probability. 

In general we don't know the prior probabilities. The MDL approach in a sense replaces the 
unknown prior probability that depends on the phenomenon being investigated by a fixed probabil- 
ity that depends on the coding used to encode the hypotheses. In ideal MDL the fixed "universal" 
probability (Appendix ^) is based on Kolmogorov complexity — the length of the shortest effective 
code (Appendix |A[) . 

In Bayes's rule we are concerned with maximizing the term Y't[H\D) over H. Taking the 
negative logarithm at both sides of the equation, this is equivalent to minimizing the expression 
-\ogVT{H\D) over H: 

-\ogVv{H\D) = -logFr{D\H) -log P{H) 
+ logPr(D). 

Since the probability Pr(D) is constant under varying H, we want to find an Hq such that 

Ho := mmargHen{-'^ogPv{D\H) - logPiH)}. (2) 

In MML as in or MDL as in |22| one roughly interprets these negative logarithms of probabilities 
as the corresponding Shannon-Fano code word lengths. Q But why not use the shortest effective 
descriptions with code word length set equal to the Kolmogorov complexities? This has an expected 

^ The term — log Pr(D|_H') is also known as the self-information in information theory and the negative log-likelihood 
in statistics. It can now be regarded as the number of bits it takes to redescribe or encode D with an ideal code 



relative to H. For the Shannon-Fano code see Section 2.3 
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code word length about equal to the entropy, but additionally it compresses each object 

by effectively squeezing out and accounting for all regularities in it. The resulting code word is 
maximally random, that is, it has maximal Kolmogorov complexity. 

Under certain constraints to be determined later, the probabilities involved in can be sub- 
stituted by the corresponding universal probabilities m(-) (Appendix^: 

log P{H) := logm(F), (3) 
logFr{D\H) := logui{D\H). 

According to |16, ^, ^ we can then substitute 

-logm(i/) = K{H), (4) 
-logm{D\H) = K{D\H), 

where K{-) is the prefix complexity of Appendix This way we replace the sum of @j by the 
sum of the minimum lengths of effective self-delimiting programs that compute descriptions of H 
and D\H. The result is the code- independent, recursively invariant, absolute form of the MDL 
principle: 



Definition 2 Given an hypothesis class 7i and a data sample D, the ideal MDL principle selects 
the hypothesis 

Ho := mmargHen{K{D\H) + K{H)}. (5) 

If there is more than one H that minimizes (^) then we break the tie by selecting the one of least 
complexity K{H). 

The key question of Bayesianism versus ideal MDL is: When is the substitution (|3|) valid? 
We show that in a simple setting were the hypotheses are finite sets the ideal MDL principle and 
Bayesianism using the universal prior m(x) coincide with each other and with the Kolmogorov 
minimal sufficient statistic. We generalize this to probabilistic hypothesis classes. In full generality 
however, ideal MDL and Bayesianism may diverge due to the distinction between the — logP(-) 
(the Shannon- Fano code length) and the Kolmogorov complexity K[-) (the shortest effective code 
length). We establish the Fundamental Inequality defining the range of coincidence of the two 
principles. 

From now on, we will denote by < an inequality to within an additive constant, and by = the 
situation when both < and > hold. 



2.1 Kolmogorov Minimal Sufficient Statistic 

Considering only hypotheses that are finite sets of binary strings of finite lengths the hypothesis 



selection principle known as "Kolmogorov's minimal sufficient statistic" |13| has a crisp formu- 



lation in terms of Kolmogorov complexity. For this restricted hypothesis class we show that the 



The relation between the Shannon- Fano code and Kolmogorov complexity is treated in Section 2.3, For clarity 
of treatment, we refer the reader to the Appendices or for all definitions and analysis of auxiliary notions. This 
way we also do not deviate from the main argument, do not obstruct the knowledgeable reader, and do not confuse or 
discourage the reader who is unfamiliar with Kolmogorov complexity theory. The bulk of the material is Appendix |^ 
on Martin-L6f 's theory of randomness tests. In particular the explicit expressions of universal randomness tests for 
arbitrary recursive distributions is unpublished apart from and partially in [^o|. 



7 



Kolmogorov minimal sufficient statistic is actually Bayesian hypothesis selection using the universal 
distribution m(-) as prior distribution and it also coincides with the ideal MDL principle. 

We follow the treatment of |l7| using prefix complexity instead of plain complexity. Let k 
and 5 be natural numbers. A binary string D representing a data sample is called (A;, 5)-stochastic 
if there is a finite set H C {0, 1}* and D € H such that 

D £ H, K{H) < k, K{D\H) > log d{H) - 6. 

The first inequality (with k not too large) means that H is sufficiently simple. The second inequality 
(with the randomness deficiency 5 not too large) means that D is an undistinguished (typical) 
element of H. Indeed, if D had properties defining a very small subset H' of H, then these could 
be used to obtain a simple description of D by determining its ordinal number in H', which would 
require logd{H') bits, which is much less than logd{H). 

Suppose we carry out some probabilistic experiment of which the outcome can be a priori every 
natural number. Suppose this number is D. Knowing D, we want to recover the probability 
distribution P on the set of natural numbers M. It seems reasonable to require that first, P has 
a simple description, and second, that D would be a "typical" outcome of an experiment with 
probability distribution P — that is, D is maximally random with respect to P. The analysis above 
addresses a simplified form of this issue where H plays the part of P — for example, H is a finite 
set of high-probability elements. In n tosses of a coin with probability p > of coming up "heads," 
the set H of outcomes consisting of binary strings of length n with n/2 I's constitutes a set of 

cardinality (^"2) = Q{2"■/^/n). To describe an element D £ H requires < n — ^logn bits. To 
describe H C {0,1}"' given n requires 0(1) bits (that is, k is small in (|^) below). Conditioning 
everything on the length n, we have 

K{D\n) < K{D\H,n) + K{H\n) < n - ^ logn, 

and for the overwhelming majority of the D's in H, 

K{D\n) > n — - log n. 

These latter Z)'s are (0(1), 0(l))-stochastic. 

The Kolmogorov structure function Kk{D\n) of D £ {0, 1}" is defined by 

Kk{D\n) = mm{logd{H) : D £ H, K{H\n) < k}. 

For a given small constant c, let ko be the least k such that 

Kk{D\n) + k<K{D\n)+c. (6) 

Let Hq be the corresponding set, and let Hq be its shortest program. This ko with K{Ho\n) < k^ 
is the least k for which the two-part description of D is as parsimonious as the best single part 
description of D. 

For this approach to be meaningful we need to show that there always exists a k satisfying (|6|). 
For example, consider the hypothesis := {-D}. Then, \ogd{Hj:)) = and K{H£)\n) = K{D\n) 
which shows that setting k := K{D\n) satisfies (^ since Kk{D\n) = 0. 
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If D is maximally complex in some set Hq^ then Hq represents all non-accidental structure 
in D. The i^rfcg(L)|n)-stage of the description just provides an index for x in Hq — essentially the 
description of the randomness or accidental structure of the string. 

Call a program H* a sufficient statistics if the complexity K(D\H, n) attains its maximum value 
= \ogd{H) and the randomness deficiency 8{D\H) = \ogd{H) — K{D\H ,n) is minimal. 

Definition 3 Let W := : i7 C {0, 1}"} and D e {0, 1}". Define 

Ho := minarg^g^{i^(/7|n) : K{H\n) + log d{H) = K{D\n)}. (7) 

The set Hq — rather the shortest program Hq that prints out the characteristic sequence of Hq E 
{0, 1}" — is called the Kolmogorov minimal sufficient statistic (KMSS) for D, given n. 

All programs describing sets H with K{H\n) < ko such that Xfcg(-D|n) + = K{D\n) are 
sufficient statistics. But the "minimal" sufficient statistic is induced by the set Hq having the 
shortest description among them. Let us now tie the Kolmogorov minimal sufficient statistic to 
Bayesian inference and ideal MDL. 

Theorem 1 Let n be a large enough positive integer. Consider the hypotheses class 7i := {H : H C 
{0, 1}"} and a data sample D G {0, 1}". All of the following principles select the same hypothesis: 

(i) Bayes's rule to select the least complexity hypothesis among the hypotheses of maximal a 
posterior probability using both (a) the universal distribution m(-) as prior distribution, and (b) 
Vi{D\H) is the uniform probability l/d{H) for D £ H and otherwise; 

(ii) Kolmogorov minimal sufficient statistic; and 

(iii) ideal MDL. 

Proof, (i) (ii). Substitute probabilities as in the statement of the theorem in 

(ii) (iii). Let Hq be the Kolmogorov minimal sufficient statistic for D so that (0) holds. In 

addition, it is straightforward that K{HQ\n) + K{D\Ho,n) > K{D\n), and K{D\Ho,n) < logd(i?o) 
because we can describe D by its index in the set Hq. Altogether it follows that K{D\HQ,n) = 
logd{HQ) and 

K{Ho\n)+K{D\HQ,n) ^ K{D\n). 
Since K{H\n) + K{D\H,n) > K{D\n) for ah G if 

Hq = {H' : H' = mmargHen{KiH\n) + K{D\H, n)]] 

then 

Hq = mmavgH{K{H) : H E Hq}, 
which is what we had to prove. □ 

Example 1 Let us look at a coin toss example. If the probability p of tossing "1" is unknown, 
then we can give a two-part description of a string D representing the sequence of n outcomes 
by describing the number k of I's in D first, followed by the index j < d{H) of D in in the 
set H of strings with k Vs. In this way "A;|n" functions as the model. If k is incompressible 
with K{k\n) = logn and K{j\k,n) = log (^) then the Kolmogorov minimal sufficient statistic is 
described by k\n in log n bits. However if p is a simple value like ^ (or 1/vr), then with overwhelming 
probability we obtain a much simpler Kolmogorov minimal sufficient characteristic by a description 

of p = i and k = ^ + 0{y/n) so that K{k\n) < \ log n. O 



9 



2.2 Probabilistic Generalization of KMSS 



Comparison of (^) , (|7|) , and Theorem || suggests a more general probabilistic version of Kolmogorov 
minimal sufficient statistic. This version turns out to coincide with maximum a posteriori Bayesian 
inference but not necessarily with ideal MDL without additional conditions. 

Definition 4 Let Ti. be an enumerable class of probabilistic hypotheses and D be an enumerable 
domain of data samples such that for every H £ TC the probability density function Pi{-\H) over 
the domain of data samples is recursive. Assume furthermore that for every data sample D in the 
domain there is an Hd S TC such that Fr{D\H£)) = 1 and K{D\H£)) = (the hypothesis forces the 
data sample). Define 

Ho := mmavgHeH{K{H) : K{H) - log Ft{D\H) ± K{D)}. (8) 

The set Hq — rather the shortest program Hq that prints out the characteristic sequence of Hq G 
{0, 1}" — is called the generalized Kolmogorov minimal sufficient statistic (GKMSS) for D. 

The requirement that for every data sample in the domain there is an hypothesis that forces it 
ensures that Hq as in definition ^ exists. | 

Theorem 2 The least complexity maximum a posteriori probability hypothesis Hq in Bayes 's rule 
using prior P{H) := in{H) coincides with the generalized Kolmogorov minimal sufficient statistic. 

Proof. Substitute P{x) := m{x) in (|2|). Using (^) the least complexity hypothesis satisfying 
the optimization problem is: 

Ho := mmargH'{H' : H' := mmargHeni^iH) " logPi{D\H)}}. (9) 

By assumptions in definition |^ there is an Hd such that K{Hd) - logPr(L>|i/i5) = K{D). It 
remains to show that K{H) - logPr(i:>|iJ) > K{D) for all H, D. 

It is straightforward that K{H) + K{D\H) > K{D). For recursive Pr(-|-) it holds that = 
— logPr(D|/f) is the code length of the effective Shannon-Fano prefix code (see Section |2.3| or 
1^, |l3) to recover D given H. Since the prefix complexity is the length of the shortest effective 

prefix code we have - log Pr(_D|//) > K{D\H). □ 



2.3 Shannon-Fano Code, Shortest Programs, and Randomness 

There is a tight connection between prefix codes, probabilities, and notions of optimal codes. The 
Shannon-Fano prefix code ^ for an ensemble of source words with probability density q has code 
word length lq{x) := — \ogq{x) (up to rounding) for source word x. This code satisfies 

Hiq)<Y,q{x)lg{x)<H{q) + l 



■^The equivalent hypothesis for a data sample D in the setting of the Kolmogorov minimal sufficient statistic was 
Hd = {D}. 



The Kolmogorov minimal sufficient statistic of Section 2.1 is the special case of the generalized version for 



hypotheses that are finite sets and with "Pt{D\H)" is the uniform probability 
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where H{q) is the entropy of q. By the Noiseless Coding Theorem this is the least expected code 
word length among all prefix codes. Therefore, the hypothesis H which minimizes (^) written as 



minimizes the sum of two prefix codes that both have shortest expected code-word lengths. This is 



more or less what MML Q and MDL ||22[ do. 

But there are many prefix codes that have expected code word length almost equal to the 
entropy. Consider only the class of prefix codes that can be decoded by Turing machines (other 
codes are not practical). There is is an optimal code in that class with code word length K{x) for 
object X. "Optimality" means that for every prefix code in the class there is a constant c such that 
for all X the length of the code for x is at least K{x) — c, see Appendix ^ or [17|. 

In ideal MDL we minimize the sum of the effective description lengths of the individual elements 
H,D involved as in (|5|). This is validated by Bayes's rule provided holds. To satisfy one part 
of @ we are free to make the new assumption that the prior probability P(-) in Bayes's rule 
(|l]) is fixed as m(-). However, with respect to the other part of (^) we cannot assume that the 
probability Pr{-\H) equals m{-\H). Namely, probability Pt{-\H) may be totally determined by the 
hypothesis H. Depending on H therefore, ^pj.(.|ff)(-D) may be very different from K(D\H). This 
holds especially for 'simple' data D which have low probability under assumption of hypothesis H. 

Example 2 Suppose we flip a coin of unknown bias n times. Let hypothesis H and data D be 
defined by: 

H := [ Probability 'head' is i] 
D := hh_^^ 

n times '/i'(ead)s 

Then we have Pt{D\H) = 1/2" and 

lp,(^.\H){D) = -logFi{D\H)=n. 

In contrast, 

K{D\H) < log n + 2 log log n. 

O 

The question arises exactly when is — log P{x) = K{x)l This is answered by the theory of individual 
randomness. Let P : {0, 1}* [0, 1] be a recursive probability density function. By Theorem 11 



(Appendix an element x is Martin-Lof random iff the universal test log(m(x)/P(x)) < 0. That 
is, — logm(2;) > — logP(x). ^ 



5 



A real-valued function is recursive if there is a Turing machine that for every argument and precision parameter 



h computes the function value within precision 2 and halts. 



^This means that every x is random with respect to the universal distribution m{x) (substitute P{x) := m(x) 



above) 
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2.4 The Fundamental Inequality 

Let us call an hypothesis class 7i rich enough if it contains a trivial hypothesis H0 satisfying 
K{Hq) = and satisfies the assumptions of definition |^. In ideal MDL as applied to such a rich 
enough hypothesis class Ti there are two boundary cases: The trivial hypothesis Hq with K{H0) = 
always implies that K{D\Hq) = K{D) and therefore K{H0)+K{D\Hq) = K{D). The hypothesis 

Hd of definition I also yields K{Hd)+K{D\Hd) = K{D). Since always K{H) + K{D\H) > K{D), 
these hypotheses minimize the ideal MDL description. 

But for trivial hypotheses only Kolmogorov random data are typical. In fact, ideal MDL 
correctly selects the trivial hypothesis for individually random data. But in general "meaningful" 
data are "nonrandom" in the sense that K(D) <^ 1{D). But then D is typical only for nontrivial 
hypotheses, and a trivial hypothesis selected by ideal MDL is not one for which the data are 
typical. We need to identify the conditions under which ideal MDL restricts itself to selection 
among hypotheses for which the given data are typical — it performs as the generalized Kolmogorov 
minimal sufficient statistic. 

Note that hypotheses satisfying may not always exist if we don't require that every data 
sample in the domain is forced by some hypothesis in the hypothesis space we consider, as we did 
in definition ^ 

Example 3 We look at a situation where the three optimization principles @j, @, ^ act differ- 
ently. Consider the outcome of n trials of a Bernoulli process (p, 1 — p). There are two hypotheses 
n = {Ho, Hi} where 

Ho = [p=l] 
Hi = [p^ ^] 

The prior P is P{Hq) = ^ and P{Hi) = ^. Consider the data sample D = 0" with n Kolmogorov 
random (also with respect to Hq and Hi) so that logn < K{D),K{D\Hq),K{D\Hi) < logn + 
2 log log n. Now 

-logP(i/o)-logPr(Z)|i/o) = n 
-logP(i/i) -logPr(i:>|i/i) = 0. 

Therefore Bayesianism selects Hi which is intuitively correct. Both hypotheses have complexity 
= 0. Therefore we can substitute — log P{H) := K{H) to obtain 

i^(i7o)-logPrp|i7o) =n 
K{Hi)-\ogVT{D\Hi) = 0. 

Now generalized Kolmogorov minimal statistic doesn't select any hypothesis at all because the 
right-hand side is unequal K{D). Ideal MDL on the other hand has the ex equo choice 

i^(//o) + i^Pli^o) = logn + 0(log logn) 
K{Hi) + K{D\Hi) = \ogn + 0{\og\ogn), 

which intuitively seems incorrect. So we need to identify the conditions under which ideal MDL 
draws correct conclusions. O 
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While we can set the prior P(-) := m(-) in Bayes's rule to obtain the generalized Kolmogorov 
minimal sufficient statistic for a rich enough hypothesis class, we cannot in general also set — log Pi{D\Hq) 
K[D\Hq) to obtain ideal MDL. The theory dealing with randomness of individual objects states 
the conditions for — logPr(Z)|//) and K{D\H) to be close: Q 

• Data sample D is "typical" for F G W iff - logFr:{D\H) < K{D\H); and 

• Data sample D is "not typical" for H iff - logPr(D|i7) » K{D\H). 
Below we need the following: 

Definition 5 Let P : J\f ^ [0, 1] be a recursive probability density distribution. Then, the 
prefix complexity K{P) of P is defined as the length of the shortest self-delimiting program for 
the reference universal prefix machine to simulate the Turing machine computing the probability 
density function P: it is the shortest effective self-delimiting description of P, (Appendix^). 

The relation between Bayes's rule and ideal MDL governed by the following: 

Theorem 3 (Fundamental Inequality) Let Pr(-|-) and P{-) he recursive probability density 
functions. 

(i) If D is Pr{-\H)-random and H is P[-)-random then 

K{D\H) + K{H)-a{P,H) (10) 

< - log Vt{D\H) - log P{H) 

< K{D\H) + K{H), 

with 

a{P,H) = K{Vv{-\H)) + K{P). 

(ii) If 

-logPT{D\H) - log P{H) = K{D\H) + K{H), 
then D is Ft{-\H) -random and H is P{-)-random. 

Proof, (i) 

Claim 1 If D is a Martin-Ldf random element of the distribution Pr(-|ff) then 

K{D\H) - K{Vi[-\H)) < - logFr:{D\H) < K{D\H). (11) 

Proof. We appeal to the following known facts, Appendix]^. Because Pr(-|i7) is recursive: 
ni{D\H) > 2~^(P'^( l-f^))Pr(D|i/), (p|). Therefore, 



Note that K{Pi{-\H)) < K{H) because from H we can compute Pr(-|i?) by assumption on Pr(-|-). 
Secondly, if is a Martin-Lof random element of the distribution Pr(-|ff), then by Theorem |ll|: 



■'This follows from (M). 
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For D's that are Martin-Lof random, ( |T2| , p^ ) mean by (^) that ( pA|) holds. 

□ 

If we set the a priori probability P{H) of hypothesis H to the universal probability then we 
obtain directly —\ogP{H) = m{x). However, we do not need to make this assumption. For a 
recursive prior we can analyze the situation when H is random with respect to P{-). 

Claim 2 If H is a Martin-Lof random element of P{-) then 

K{H) - K{P) < - log P{H) < K{H). (14) 



Proof. Analogous to the proof of (|n]). □ 

Together ([TTI , |T^ ) yield the theorem, part (i). 

(ii) Follows from (|). □ 

Remark 1 We would like the theorem to hold for the overwhelming majority of (data,hypothesis)- 
pairs. It can be shown that a fixed fraction of all objects of every length are random according to 
the universal randomness test for finite objects (Theorem 11, Appendix In the sample space 



of infinite binary sequences a related randomness test shows infinite binary sequences are random 
with probability one (Lemma |^, Section We would like to state that almost all finite objects are 
random for all recursive distributions. Although this is the case for many probability distributions 
(for example m(-)) we can ensure it for all probability distributions by relaxing the randomness 
condition. 

Clearly, for finite sequences randomness viewed as absence of regularities is a matter of degree: 
it doesn't make sense to say that x is random and x with the first bit flipped is non-random. 
Relaxing the arbitrary dividing line we call finite x's of length n say weakly random if — log m.{x) > 
— logP{x) — logn. This is more in line with Martin-Lof 's original universal randomness test for 
finite binary strings which are a little weaker than log(m(x)/P(x)) < (Lemma ^ Appendix 
Now 

Y,P{x)2^°^^ = 5]m(x) 

X X 

for some constant e > 0. The last inequality (Lemma 4.3.2 in [^) is essentially due to the halting 
problem. By Markov's inequality (Appendix j^) with overwhelming P-probability (> 1 — 1/n) an 
X G {0, 1}* is weakly random. 

If we relax our notion of individual randomness to say weak randomness then at least a fraction 
of 1 — 1/n of all binary strings of length n is weakly random. This may cause an increase of a{P, H) 
by an additive logarithmic term (logarithmic in the length of P and H). O 



Remark 2 (Range of Validity of FI) Hypothesis H is P-random means that H is "typical" 
for the prior distribution P{-) in the sense that it must not belong to any effective minority (sets 
on which a minority of P-probability is concentrated). That is, hypothesis H does not have any 
effectively testable properties that distinguish it from a majority. In [0 it is shown that this is the 
set of i^'s such that K{H) = — log P{H). In case P{H) = ui{H), that is, the prior distribution 
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equals the universal distribution, then for all H we have K{H) = — log P{H), that is, all hypotheses 
are random with respect to the universal distribution. 

For other prior distributions some hypotheses are random, and some other hypotheses are 
nonrandom. Let the possible hypotheses correspond to the binary strings of length n, and let Pn 
be the uniform distribution that assigns probability Pn{H) = 1/2"" to every hypothesis H. Let us 
assume that the hypotheses are coded as binary strings of length n, so that H G {0, 1}". Then, 

-ff := 00 . . . has low complexity: K{H\n) < log n. However, — log Pn{H) = n. Therefore, by (p!4|), 
H is not i-*„-random. If we obtain H hy n flips of a fair coin however, then with overwhelming 
probability we will have that K{H\n) = n and therefore — log Pn{H) = K{H\n) and H is P„- 
random. 

That data sample D is Pr(-|/f)-random means that the data are random with respect to the 
probability distribution Ft{-\H) induced by the hypothesis H. Therefore, we require that the 
sample data D are 'typical', that is, 'randomly distributed' with respect to Fr{-\H). 

O 



Remark 3 (Optimal Hypothesis Doesn't Satisify FI) The only way to violate the Funda- 
mental Inequality is that either D is not Pr(-|i7)-random and therefore — logPr{D\H) ^ K{D\H), 
or that H is not P-random and therefore — log P{H) ^ K{H). We give an example of the first 
case. 

Consider the identification of a Bernoulli process Bp = (p,l — p) (0 < p < 1) that generates 
a given data sample D £ {0, 1}". Let Fi{D\Bp,n) denote the distribution of the outcome D of n 
trials of the process Bp. If the data D are "atypical" like = 00 ... (n failures) for p = \ and n 
large, then it violates the 'Pv{-\Bii2,n)-Tan.dova\iess test ( p^ by having — log Pr(D| 5^/2 ) = ^-iid 

K{D\Bii2) < log n + 2 log log n. 

Let the true hypothesis is Bq. The data sample D = 00 ... is Martin-Lof random with respect 
to Bq. In fact, for every p the data sample generated by Bp is with overwhelming likelihood Martin- 
Lof random with respect to Bp. For such data samples, if furthermore the prior probability P(-) := 
m(-), then the Fundamental Inequality holds. If in fact P1/2 would have been the true hypothesis 
and we have obtained the same data D = 00 ... (n zeros) as before, then the Fundamental 
Inequality is violated for this true hypothesis and with an appropriate prior Bayes's rule selects 
Bi/2 while MDL selects Bq. O 

2.5 Ideal MDL and Bayesianism 

The best model or hypothesis to explain the data is one that is a "typical" element of the prior 
distribution and the data are "typical" for the contemplated hypothesis — as prescribed by Kol- 
mogorov's minimum sufficient statistics. Thus, it is reasonable to contemplate only admissible 
hypotheses as defined below in selecting the best one. 

Definition 6 Given data sample D and prior probability P, we call a hypothesis H admissible 
if H is P-random and D is Pt{-\H) -random (which implies that the Fundamental Inequality ^idj ) 
holds). 



This follows from (||). 
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Theorem 4 Let the data sample he D and let the corresponding set of admissible hypotheses be 
7io ^ 'H. Then the maximum a posteriori probability hypothesis i^bayes G T~(-D in Bayes's rule and 
the hypothesis H^^i ^ T~(-D selected by ideal MDL are roughly equal: 

aiP,H) > KiD\H^di) + KiH^di) 

- K{D\H^.^y,,) - i^(i/bayes) > 0. 

Proof. If in the Fundamental Inequality a{P,H) is small then this means that both the 
prior distribution P is simple, and that the probability distribution Pr(-|if) over the data samples 
induced by hypothesis H is simple. In contrast, if a{P, H) is large, which means that either of 
the mentioned distributions is not simple, for example when i^(Pr(-|i^)) = K{H) for complex H, 
then there may be some discrepancy. Namely, in Bayes's rule our purpose is to maximize Y'i{H\D), 
and the hypothesis H that minimizes K[D\H) + K{H) also maximizes Pr(H\D) up to a 2~"^^'^^ 
multiplicative factor. Conversely, the H that maximizes Pr(i^|L') also minimizes K{D\H) + K{H) 
up to an additive term a{P,H). That is, with 

H^dl := uimaigH, {K{H') : H' := minarg^e^^{K(L»|i/) + K{H)}} 

i^bayes := minarg^, {/C (i^') : H' := maxarg^g^^{Pr(i/|Z))}} (16) 

we obtain ( [l5| ) from (10). □ 



As a consequence, if a{P,H) is small enough and Bayes's rule selects an admissible hypoth- 
esis, and so does ideal MDL, then both criteria are (approximately) optimized by both selected 
hypotheses. 

In order to identify application of MDL with application of Bayes's rule on some prior distri- 
bution P as in Theorem ^ we must assume that, given D, the Fundamental Inequality is satisfied 
for -ffindi as defined in (|^). This means that -ffmdi is -P-random for the used prior distribution P. 
One choice to guarantee this is 

P(.) := m(.)(= 2-^( )). 

This is a valid choice even though m is not recursive. Namely, we only require that m(-)/P(-) 
be enumerable (Definition |lO|, Appendix which is certainly guaranteed by choice of P(-) := 
m(-). The choice of m(-) as prior is an objective and recursively invariant Occam's razor: simple 
hypothesis H (with K{H) <^ 1{H) have high m-probability, and complex or random hypothesis 
H (with K{H) 1{H)) have low m-probability 2^^^^\ The randomness test \og{m{H) / P{H)) 
evaluates to for every H, which means that all hypotheses are random with respect to distribution 
mf-). 



Theorem 5 Let a{P, H) in the FI ( [7^ be small (for example a = Q) and prior P{-) := m(-). Then 
the Fundamental Inequality ([7^ is satisfied iff data sample D is (almost) Vv[-\H^^\)-random. 



Proof. With a{P,H) = and P(-) := m(-) (so - log P{H) = K{H) by (|)) we can rewrite 
dD as 

-\ogVi{D\H) = K{D\H). 
The theorem follows by Theorem 1 1 in Appendix |y. □ 
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If there is a true probabilistic model H then with high probability the data sample D will be 
Pr(-|//)-random. This suggests that in selecting the best model we should contemplate only those 
models H for which this is the case. The requirement that D be Pr(-|//indi)-ra'ndom constrains the 
domain of hypotheses from which we can choose -ffmdi- 

Corollary 1 With high probability ideal MDL is an application of Bayes's rule with the universal 
prior distribution m(-) and selection of an optimal admissible hypothesis -f^mdl such that the data 
sample D is Fr{-\H^di)-random (in the sense of Appendix 

Since the notion of individual randomness incorporates all effectively testable properties of 
randomness (in the finite case only to some degree), application of ideal MDL will select the 
simplest hypothesis H that balances K(D\H) and K(H) such that the data sample D is random 
to it — as far as can effectively be ascertained. 

Restricted to the class of admissible hypotheses, ideal MDL does not simply select the hypoth- 
esis that precisely fits the data but it selects an hypothesis that would typically generate the data. 
With some amount of overstatement one can say that if one obtains perfect data for a true hy- 
pothesis, then ideal MDL interprets these data as data obtained from a simpler hypothesis subject 
to measuring errors. Consequently, in this case ideal MDL is going to give you the false simple 
hypothesis and not the complex true hypothesis. 

• Ideal MDL only gives us the true hypothesis if the data satisfies certain conditions relative to 
the true hypothesis. Stated differently: there are only data and no true hypothesis for ideal 
MDL. The principle simply obtains the hypothesis suggested by the data and it assumes that 
the data are random with respect to the hypothesis. 



2.6 Applications 



Unfortunately, the function K is not computable, [17|. For practical applications one must settle 
for easily computable approximations, for example, restricted model classes and particular codings 
of hypotheses. In this paper we will not address the question which encoding one uses in practice, 



but refer to references [p2l p^, p3^, 3C ] . 



In statistical applications, H is some statistical distribution (or model) H = P{9) with a list 
of parameters 9 = {9i, . . . ,6^), where the number k may vary and influence the (descriptional) 
complexity of 9. (For example, H can be a normal distribution N{fi,a) described hy 9 = {n,a).) 
Each parameter 9i is truncated to fixed finite precision. The data sample consists of n outcomes 
y = (yi, . . . , Xn) of n trials x = (xi, . . . , x„) for distribution P{9). The data D in the above formulas 
is given as D = (x, y). By expansion of conditional probabilities we have therefore 

Pr(Z)|F) = Pr(x,y|F) = Pr(x|F) • Pr(y|F,x). 

In the argument above we take the negative logarithm of Y'i{D\H), that is, 

-logPr(L»|F) = -logPr(x|i?) - log Pr(y|i?, x). 

Taking the negative logarithm in Bayes's rule and the analysis of the previous section now yields 
that MDL selects the hypothesis with highest inferred probability satisfying x is Pr(-|i?)-random 
and y is Pr(-|i/, x)-random. Bayesian reasoning selects the same hypothesis provided the hypothesis 
with maximal inferred probability has x,y satisfy the same conditions. 
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Example 4 (Learning Polynomials) We wish to fit a polynomial / of unknown degree to a set 
of data points D such that it can predict future data y given x. Even if the data did come from a 
polynomial curve of degree, say two, because of measurement errors and noise, we still cannot find 
a polynomial of degree two fitting all n points exactly. In general, the higher the degree of fitting 
polynomial, the greater the precision of the fit. For n data points, a polynomial of degree n — 1 
can be made to fit exactly, but probably has no predictive value. Applying ideal MDL we look for 
i/^dl := minargj^{K(x,y|i?) + K{H)}. 

Let us apply the ideal MDL principle where we describe all (fc — l)-degree polynomials by a 
vector of k entries, each entry with a precision of d bits. Then, the entire polynomial is described 
by 

kd + (log kd) hits. (17) 

(We have to describe k, d, and account for self-delimiting encoding of the separate items.) For 
example, ax^ + bx + c is described by (a, b, c) and can be encoded by about 3d bits. Each datapoint 
{xi,yi) which needs to be encoded separately with precision of d bits per coordinate costs about 2d 
bits. 

For simplicity assume that probability Pt{x\H) = 1 (because x is prescribed). To apply the 
ideal MDL principle we must trade the cost of hypothesis H (|l^) against the cost of describing y 
given H and x. As a trivial example, suppose n — 1 out of n datapoints fit a polynomial of degree 
2 exactly, but only 2 points lie on any polynomial of degree 1 (a straight line). Of course, there is 
a polynomial of degree n — 1 which fits the data precisely (up to precision). Then the ideal MDL 
cost is 3d + 2d for the 2nd degree polynomial, 2d+ {n — 2)d for the 1st degree polynomial, and nd 
for the (n — l)th degree polynomial. Given the choice among those three options, we select the 2nd 
degree polynomial for all n > 5. O 



Remark 4 (Exception-Based MDL) A hypothesis H minimizing K{D\H) + K{H) always sat- 
isfies 

K{D\H) + K{H) > K{D). 

Let E Q D denote the subset of the data that are exceptions to H in the sense of not being classified 
correctly by H. The following exception-based MDL (E-MDL) is sometimes confused with MDL: 
With E := D — Dh and Dh is the data set classified according to H, select 

^e-mdl = minargj^,{K(/i") : H' := minarg^{ir(/7) + K{E\H)]]. 

In E-MDL we look for the shortest description of an accepting program for the data consisting of 
a classification rule H and an exception list E. While this principle sometimes gives good results, 
application may lead to absurdity as the following shows: 

In many problems the data sample consists of positive examples only. For example, in learning 
(a grammar for) English language, given the Oxford English Dictionary. According to E-MDL the 
best hypothesis is the trivial grammar H generating all sentences over the alphabet. Namely, this 
grammar gives K{H) = independent of D and also E := 0. Consequently, 

min{K{H) + K{E\H)} = K{H) = 0, 

which is absurd. The E-MDL principle is vindicated and reduces to standard MDL in the context 
of interpreting D = (x, y) with x fixed as in "supervised learning." Now for constant K{'x\H) 

^e-mdl = minarg^,{E:(//') : H' := minavgH{K{H) + K{y\H,x) + K{x\H)}} 
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is the same as 



i/j^dl = minargj^,{i^(i7') : H' := minarg^{i^(ii') + K{y\H,iL)]]. 
Ignoring the constant x in the conditional K{y\H,'x) corresponds to K{E\H). O 

3 Prediction by Minimum Description Length 

Let us consider theory formation in science as the process of obtaining a compact description of 
past observations together with predictions of future ones. Ray Solomonoff |2^, ^ argues that the 
prehminary data of the investigator, the hypotheses he proposes, the experimental setup he designs, 
the trials he performs, the outcomes he obtains, the new hypotheses he formulates, and so on, can 
all be encoded as the initial segment of a potentially infinite binary sequence. The investigator 
obtains increasingly longer initial segments of an infinite binary sequence w by performing more 
and more experiments on some aspect of nature. To describe the underlying regularity of oo, 
the investigator tries to formulate a theory that governs u; on the basis of the outcome of past 
experiments. Candidate theories (hypotheses) are identified with computer programs that compute 
binary sequences starting with the observed initial segment. 

There are many different possible infinite sequences (histories) on which the investigator can 
embark. The phenomenon he wants to understand or the strategy he uses can be stochastic. Each 
such sequence corresponds to one never-ending sequential history of conjectures and refutations and 
confirmations and each initial segment has different continuations governed by certain probabilities. 
In this view each phenomenon can be identified with a measure /i on the continuous sample space 
of infinite sequences over a basic description alphabet. This distribution /i can be said to be the 
concept or phenomenon involved. Now the aim is to predict outcomes concerning a phenomenon /i 
under investigation. In this case we have some prior evidence (prior distribution over the hypotheses, 
experimental data) and we want to predict future events. 

This situation can be modelled by considering a sample space S of one-way infinite sequences of 
basic elements B defined by 5 = B°° . We assume a prior distribution /x over S with //(x) denoting 
the probability of a sequence starting with x. Here /i(-) is a semimeasur^ satisfying 

/«(e) < 1 

/^(x) > ^/x(ra). 

Given a previously observed data string x, the inference problem is to predict the next symbol in 
the output sequence, that is, to extrapolate the sequence x. In terms of the variables in ([l|), H^y is 
the hypothesis that the sequence starts with initial segment xy. Data consists of the fact that 
the sequence starts with initial segment x. Then, Y'T:{Dx\Hxy) = 1, that is, the data is forced by 
the hypothesis, or VT{Dz\Hxy) = for z is not a prefix of xy, that is, the hypothesis contradicts the 
data. For P{Hxy) and Y'i{Dx) in (|l|) we substitute ^{xy) and /i(x), respectively. For Vv[Hxy\Dx) 
we substitute ^[y\x). This way (|l]) is rewritten as 

^Traditional notation is "/i(ra;)" instead of "jj.{x)" where cylinder Fx = {uJ € S : uj starts with a:;}. We use "/i(a;)" 
for convenience, fi is a. measure if equahties hold. 
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The final probability is the probability of the next symbol string being y, given the initial 

string X. Obviously we now only need the prior probability /i to evaluate The goal of 

inductive inference in general is to be able to either (i) predict, or extrapolate, the next element 
after x or (ii) to infer an underlying effective process that generated x, and hence to be able to 
predict the next symbol. In the most general deterministic case such an effective process is a 
Turing machine, but it can also be a probabilistic Turing machine or, say, a Markov process. The 
central task of inductive inference is to find a universally valid approximation to /i which is good 
at estimating the conditional probability that a given segment x will be followed by a segment y. 

In general this is impossible. But suppose we restrict the class of priors (j, to the recursive 
semimeasuresP^ and restrict the set of basic elements to {0, 1}. Under this relatively mild restriction 
on the admissible semimeasures /i, it turns out that we can use the universal semimeasure M as a 
"universal prior" (replacing the real prior //) for prediction. The theory of the universal semimeasure 
M, the analogue in the sample space {0, 1}°° of m in the sample space {0, 1}* equivalent to 
Af, is developed in Chapter 4, and Chapter 5. It is defined with respect to a special type 
Turing machine called monotone Turing machine. The universal semimeasure M multiplicatively 
dominates all enumerable (Definition |lO|, Appendix |^) semimeasures. It can be shown that if we 
flip a fair coin to generate the successive bits on the input tape of the universal reference monotone 
Turing machine, then the probability that it outputs xa (x followed by something) is M(x), |34]. 



The universal probability ]V[(-) allows us to explicitly express a universal randomness test for 
the elements in {0, 1}°° analogous to the universal randomness tests for the finite elements of {0, 1}* 
developed in Appendix ^ This notion of randomness with respect to a recursive semimeasure ^ 
satisfies the following explicit characterization of a universal (sequential) randomness test (for proof 
see 0], Chapter 4): 

Lemma 1 Let ^ he a recursive semimeasure on {0, 1}°°. An infinite binary sequence oj is ^-random 

sup M.{u!l . . . UJn)/lJ^{u)l . . . UJn) < OO, 
n 

and the set of fi-random sequences has fi-measure one, 

In contrast with the discrete case, the elements of {0, 1}°° can be sharply divided in the random 
ones that pass all effective (sequential) randomness tests and the nonrandom ones that do not. 
We start by demonstrating convergence of Nl{y\x) and fi{y\x) for x — > oo, with ^-probability 

1-H 

there is a Turing machine that for every x and 6 computes fj,{x) within precision 2~^. 

We can express the "goodness" of predictions according to M with respect to a true /i as follows: Let S„ be 
the /i-expected value of the square of the difference in /^-probability and M-probability of occurring at the nth 
prediction 

S„= ^ /iW(M(0|T)-/i(Ok)f. 

l{x) — 'n — l 

We may call S„ the expected squared error at the nth prediction. The following celebrated result of Solomonoff, [p6|, 
says that M is very suitable for prediction (a proof using Kulback-Leibler divergence is given in [0): 

Theorem 6 Let fj, be a recursive semimeasure. Using the notation above, S„ < k/2 with k — K{fi) In 2. {Hence, 
Sn converges to faster than 1/n.) 

However, Solomonoff 's result is not strong enough to give the required convergence of conditional probabilities with 
/i-probability 1. 
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Theorem 7 Let fj, be a positive recursive measure. If the length of y is fixed and the length of x 
grows to infinity, then 

M{y\x) ^ ^ 

with fi-probability one. The infinite sequences lo with prefixes x satisfying the displayed asymptotics 
are precisely the js-random sequences. 

Proof. We use an approach based on the Submartingale Convergence Theorem, pp. 
324-325, which states that the following property holds for each sequence of random variables 
uji,u!2, .... If f{oJi-n) is a ^-submartingale, and the /i-expectation E|/(a'i:„)| < oo, then it follows 
that lim„^oo /(f-^iin) exists with /^-probability one. 

In our case, 

t[LJl;n\ll) = ^ 

/^(Wl;„) 

is a /i-submartingale, and the /^-expectation Et(u;i:„|/x) < 1. Therefore, there is a set ^ C {0, 1}°° 
with fJ,{A) = 1, such that for each to G A the limit linin^ooti^i-.nlfJ') < oo. These are the /i-random 
w's by Corollary 4.5.5 in Consequently, for fixed m, for each lo in A, we have 



n-^oo M(u;i:„)//x(u;i:„) 

provided the limit of the denominator is not zero. The latter fact is guarantied by the universality 
of M: for every x £ {0, 1}* we have M(a;)//x(j;) > 2~^^^'^ by Theorem 4.5.1 and Equation 4.11 in 
0. □ 

Example 5 Suppose we are given an infinite decimal sequence lo. The even positions contain 
the subsequent digits of vr = 3.1415..., and the odd positions contain uniformly distributed, in- 
dependently drawn random decimal digits. Then, M(a|ci;i:2i) 1/10 for a = 0,1,..., 9, while 
M(a|L(-'i:2j+i) —> 1 if a is the zth digit of vr, and to otherwise. O 

The universal distribution combines a weighted version of the predictions of all enumerable semimea- 
sures, including the prediction of the semimeasure with the shortest program. It is not a priori 
clear that the shortest program dominates in all cases — and as we shall see it does not. How- 
ever, we show that in the overwhelming majority of cases — the typical cases — the shortest program 
dominates sufficiently to use shortest programs for prediction. 

Taking the negative logarithm on both sides of (^), we want to determine y with l{y) = n that 
minimizes 

-log/i(y|x) = -log/i(a;y) + \og^j.{x). 
This y is the most probable extrapolation of x. 

Definition 7 Let U be the reference monotone machine. The complexity Km, called monotone 
complexity, is defined as 

Km{x) = min{/(p) : U{p) = xu;,u; £ {0, 1}°°}. 

We omit the Invariance Theorem for Km complexity, stated and proven completely analogous to 
the Theorems with respect to the C and K varieties. 
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Theorem 8 Let ^ he a recursive semimeasure, and let uj he a fi-random infinite hinary sequence 
and xy be a finite prefix of uj. For l[x) grows unboundedly and l{y) fixed, 

lim — log = Km{xy) — Kmix) < oo, 

l{x)—*oo 

where Km{xy) and Km{x) grow unboundedly. 

Proof. By definition, — logM(a;) < Km{x) since the left-hand side of the inequality weighs 
the probability of all programs that produce x while the right-hand side weighs the probability of 
the shortest program only. In the discrete case we have the Coding Theorem |2|: K{x) = — logm(x). 



L.A. Levin, |15] erroneously conjectured that also Km{x) = — logM(x). But P. Gacs |1C] showed 



that they are different, although the differences must in some sense be very small: 
Claim 3 

-logM(x) < Km{x) < -logM(2;) + Km{l{x)); (19) 
sup \KM{x) — Km{x)\ = oo. 

x6{0,l}* 

However, for a priori almost all infinite sequences x, the difference between Km{-) and — logM(-) 



is bounded by a constant [10|: 



Claim 4 (i) For random strings x G {0, 1}* we have Km{x) + logM(x) = 0. 

(ii) There exists a function f{n) which goes to infinity with n — > oo such that Km{x) + 
logM(x) > f{l{x)), for infinitely many x. If x is a finite binary string, then we can choose 
f{n) as the inverse of some version of Ackermann's function 

Let (J be a /^-random infinite binary sequence and xy be a finite prefix of oj. For l{x) grows 
unboundedly with l{y) fixed, we have by Theorem |^ 

lim log/i(y|x) - logM(y|x) = 0. (20) 

Therefore, if x and y satisfy above conditions, then maximizing fi{y\x) over y means minimizing 
— logM(y|3;). It is shown in Claim |^ that — logM(x) is slightly smaller than Km{x), the length 
of the shortest program for x on the reference universal monotonic machine. For binary programs 
this difference is very small. Claim ^, but can be unbounded in the length of x. 

Together this shows the following. Given xy that is a prefix of a (possibly not /i-random) lo, 
optimal prediction of fixed length extrapolation y from an unboundedly growing prefix x oi uj need 
not necessarily be achieved by the shortest programs for xy and x minimizing Km{xy) — Km{x), 
but is achieved by considering the weighted version of all programs for xy and x which is represented 
by 

— log M(xy) + log 'M.{x) = {Km{xy) — g{xy)) — {Km{x) — g{x)). 

Here g{x) is a function which can rise to in between the inverse of the Ackermann function and 
Km{l{x)) < log log a; — but only in case x is not //-random. 

Therefore, for certain x and y which are not //-random, optimization using the minimum length 
programs may result in incorrect predictions. For //-random x we have that — logM(x) and Km{x) 
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coincide up to an additional constant independent of x, that is, g{xy) = g{x) = 0, Claim ^. Hence, 
together with Equation 20, the theorem is proven. □ 



By its definition Km is monotone in the sense that always Km{xy) — Km{x) > 0. The closer 
this difference is to zero, the better the shortest effective monotone program for x is also a shortest 
effective monotone program for xy and hence predicts y given x. Therefore, for all large enough 
/i-random x, predicting by determining y which minimizes the difference of the minimum program 
lengths of xy and x gives a good prediction. Here y should be preferably large enough to eliminate 
the influence of the 0(1) term. 

Corollary 2 (Prediction by Data Compression) Assume the conditions of Theorem^. With 
fj,-probability going to one as l{x) grows unboundedly, a fixed-length y extrapolation from x max- 
imizes /i(y|x) iff y can be maximally compressed with respect to x in the sense that it minimizes 
Km{xy) — Km{x) . That is, y is the string that minimizes the length difference between the shortest 
program that outputs xy . . . and the shortest program that outputs x . . .. 

4 Conclusion 

The analysis of both hypothesis identification by ideal MDL and prediction shows that maximally 
compressed descriptions give good results on the data samples which are random with respect to 
probabilistic hypotheses. These data samples form the overwhelming majority and occur with 
probability going to one when the length of the data sample grows unboundedly. 
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A Appendix: Kolmogorov Complexity 

The Kolmogorov complexity of a finite object x is the length of the shortest effective binary 
description of x. We give some definitions to establish notation. For more details see Let 
x,y,z £ M, where M denotes the natural numbers and we identify M and {0, 1}* according to the 
correspondence 

(0,e), (1,0), (2,1), (3, 00), (4, 01),... 

Here e denotes the empty word " with no letters. The length l{x) of x is the number of bits in the 
binary string x. For example, /(OlO) = 3 and /(e) = 0. 

The emphasis is on binary sequences only for convenience; observations in any alphabet can be 
so encoded in a way that is 'theory neutral'. 

A binary string j; is a proper prefix of a binary string y if we can write x = yz for z ^ e. A set 
{x,y, . . .} C {0, 1}* is prefix-free if for any pair of distinct elements in the set neither is a proper 
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prefix of the other. A prefix-free set is also cahed a prefix code. Each binary string x = xiX2 ■ ■ - Xn 
has a special type of prefix code, called a self- delimiting code, 

X = j;iXlX2X2 . . . Xn^Xn, 

where ^Xn = if x^i = 1 and -ix^ = 1 otherwise. This code is self-delimiting because we can 
determine where the code word x ends by reading it from left to right without backing up. Using 
this code we define the standard self-delimiting code for x to be x' = l{x)x. It is easy to check that 
l{x) = 2n and l{x') = n + 2 log n. 

Let Ti,T2, . . . be a standard enumeration of all Turing machines, and let (j)i,(j)2, ■ ■ ■ be the 
enumeration of corresponding functions which are computed by the respective Turing machines. 
That is, Tj computes (pi. These functions are the partial recursive functions or computable functions. 
The Kolmogorov complexity C{x) of x is the length of the shortest binary program from which x 
is computed. Formally, we define this as follows. 

Definition 8 The Kolmogorov complexity of x given y (for free on a special input tape) is 
C{x\y) = mm{l{i'p) : (l)i{p,y) = x,p £ {0, l}*,i £ M}. 

p,i 

Define C{x) = C{x\e). 

Though defined in terms of a particular machine model, the Kolmogorov complexity is machine- 
independent up to an additive constant and acquires an asymptotically universal and absolute 
character through Church's thesis, from the ability of universal machines to simulate one another 
and execute any effective process. The Kolmogorov complexity of an object can be viewed as an 
absolute and objective quantification of the amount of information in it. This leads to a theory of 
absolute information contents of individual objects in contrast to classic information theory which 



deals with average information to communicate objects produced by a random source [17|. 

For technical reasons we also need a variant of complexity, so-called prefix complexity, which 
associated with Turing machines for which the set of programs resulting in a halting computation 
is prefix free. We can realize this by equiping the Turing machine with a one-way input tape, a 
separate work tape, and a one-way output tape. Such Turing machines are called prefix machines 
since the halting programs for anyone of them form a prefix free set. Taking the universal prefix 
machine U we can define the prefix complexity analogously with the plain Kolmogorov complexity. 
If X* is the first shortest program for x then the set {x* : U{x*) = x,x £ {0, 1}*} is a prefix code. 
That is, each x* is a code word for some x, and if x* and y* are code words for x and y with x ^ y 
then X* is not a prefix of x. 

Let (•) be a standard invertible effective one-one encoding from J\f x J\f to prefix- free recursive 
subset of M. For example, we can set {x,y) = x'y'. We insist on prefix-freeness and recursiveness 
because we want a universal Turing machine to be able to read an image under (•) from left to 
right and determine where it ends. 

Definition 9 The prefix Kolmogorov complexity of x given y (for free) is 

K{x\y) = mm{l{{p,i)) : MiP^v)) = ^ {0,l}*,i G M}. 

Define K{x) = K{x\e). 
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The nice thing about K{x) is that we can interpret 2~^(^) as a probabihty distribution. Namely, 
K{x) is the length of a shortest prefix-free program for x. By the fundamental Kraft's inequality, 
see for example |^], we know that if /i, /21 • • • are the code- word lengths of a prefix code, then 
J2x 2"^"" ^ 1- This leads to the notion of universal distribution — a rigorous form of Occam's razor- 
below. 



B Appendix: Universal Distribution 

A Turing machine T computes a function on the natural numbers. However, we can also consider 
the computation of real valued functions. For this purpose we consider both the argument of (j) 
and the value of (/> as a pair of natural numbers according to the standard pairing function (•). We 
define a function from M to the reals 7^ by a Turing machine T computing a function (p as follows. 
Interprete the computation (j){{x,t)) = {p,q) to mean that the quotient p/q is the rational valued 
tth approxmation of f{x). 

Definition 10 A function / : AA — > 7?. is enumerable if there is a Turing machine T computing a 
total function <j) such that (j){x,t + 1) > 4>{x,t) and lim^^oo <Pix,t) = f{x). This means that / can 
be computably approximated from below. If / can also be computably approximated from above 
then we call / recursive. 

A function P : AA ^ [0, 1] is a probability distribution if J2xeAf ^i^) — 1- (The inequality is a 
technical convenience. We can consider the surplus probability to be concentrated on the undefined 
element u AA) . 

Consider the family £V of enumerable probability distributions on the sample space M (equiv- 



alently, {0, 1}*). It is known, |17|, that £V contains an element m that multiplicatively dominates 
all elements of SV. That is, for each P £ £V there is a constant c such that c m(x) > P{x) for all 
x £ N . We call m a universal distribution. 

The family 8V contains all distributions with computable parameters which have a name, or 
in which we could conceivably be interested, or which have ever been considered. The dominating 
property means that m assigns at least as much probability to each object as any other distribution 
in the family £V does. In this sense it is a universal a priori by accounting for maximal ignorance. 
It turns out that if the true a priori distribution in Bayes's rule is recursive, then using the single 
distribution m, or its continuous analogue the measure M on the sample space {0, 1}°° (Section |3|), 
is provably as good as using the true a priori distribution. 

We also know, [16, ^, that 



Lemma 2 

-\ogm{x) = K{x)±0 {I). (21) 

That means that m assigns high probability to simple objects and low probability to complex or 

random objects. For example, for x = 00 ... (n O's) we have K{x) = K{n) < logn + 2 log log n 
since the program 

print n_tinies a ''0'' 

prints X. (The additional 2 log logn term is the penalty term for a self-delimiting encoding.) Then, 
l/(n log^ n) = 0(m(x)). But if we flip a coin to obtain a string y of n bits, then with overwhelming 

probability K{y) > n (because y does not contain effective regularities which allow compression), 
and hence m(y) = 0(1/2"'). 
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C Appendix: Randomness Tests 



One can consider those objects as nonrandom in which one can find sufficiently many regularities. 
In other words, we would like to identify "incompressibility" with "randomness." This is proper if 
the sequences that are incompressible can be shown to possess the various properties of randomness 
(stochasticity) known from the theory of probability. That this is possible is the substance of the 
celebrated theory developed by the Swedish mathematician Per Martin-Lof This theory was 
further elaborated in |34, p4|, 111] and later papers. 



There are many properties known which probability theory attributes to random objects. To 
give an example, consider sequences of n tosses with a fair coin. Each sequence of n zeros and ones 
is equiprobable as an outcome: its probability is 2~". If such a sequence is to be random in the 
sense of a proposed new definition, then the number of ones in x should be near to n/2, the number 
of occurrences of blocks "00" should be close to n/4, and so on. 

It is not difficult to show that each such single property separately holds for all incompressible 
binary strings. But we want to demonstrate that incompressibility implies all conceivable effectively 
testable properties of randomness (both the known ones and the as yet unknown ones). This way, 
the various theorems in probability theory about random sequences carry over automatically to 
incompressible sequences. 

In the case of finite strings we cannot hope to distinguish sharply between random and nonran- 
dom strings. For instance, considering the set of binary strings of a fixed length, it would not 
be natural to to fix an m and call a string with m zeros random and a string with m + 1 zeros 
nonrandom. 

Let us borrow some ideas from statistics. We are given a certain sample space S with an associated 
distribution P. Given an element x of the sample space, we want to test the hypothesis "x is a 
typical outcome." Practically speaking, the property of being typical is the property of belonging 
to any reasonable majority. In choosing an object at random, we have confidence that this object 
will fall precisely in the intersection of all such majorities. The latter condition we identify with x 
being random. 

To ascertain whether a given element of the sample space belongs to a particular reasonable 
majority we introduce the notion of a test. Generally, a test is given by a prescription which, for 
every level of significance e, tells us for what elements x of S" the hypothesis "x belongs to majority 
M in S" should be rejected, where e = 1 — P{AI). Taking e = 2~™, m = 1, 2, . . ., this amounts to 
saying that we have a description of the set V M x S of nested critical regions 

Vm = {x : (m, x) G V} 

Vm 2 Vm+i, m = l,2, 

The condition that Vm be a critical region on the significance level e = 2"™ amounts to requiring, 
for all n 

J2{Pix):l{x) = n,xeVra}<e. 

X 

The complement of a critical region Vm is called the (1 — e) confidence interval. If x G Vm, then 
the hypothesis "x belongs to majority M," and therefore the stronger hypothesis "x is random," is 
rejected with significance level e. We can say that x fails the test at the level of critical region Vm- 
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Example 6 A string xiX2 ■ ■ - Xn with many initial zeros is not very random. We can test this aspect 
as follows. The special test V has critical regions Vi,V2, ■ ■ ■■ Consider x = O.X1X2 ■ ■ ■ Xn as a rational 
number, and each critical region as a half-open interval Vm = [0, 2^™) in [0, 1), ra = 1,2,.... Then 
the subsequent critical regions test the hypothesis "a; is random" by considering the subsequent 
digits in the binary expansion of x. We reject the hypothesis on the significance level e = 2~"* 
provided xi = x^, = ■ ■ ■ = Xm = 0, O 

Example 7 Another test for randomness of finite binary strings rejects when the relative frequency 
of ones differs too much from 1/2. This particular test can be implemented by rejecting the 
hypothesis of randomness oi x = X1X2 ■ ■ ■ Xn at level e = 2~"^ provided \2fn — n\ > g{n, m), where 
/„ = J27=i ^"^^ ^('T')"^) is the least number determined by the requirement that the number of 
binary strings x of length n for which this inequality holds is at most 2""*". O 

In practice, statistical tests are effective prescriptions such that we can compute, at each level 
of significance, for what strings the associated hypothesis should be rejected. It would be hard 
to imagine what use it would be in statistics to have tests that are not effective in the sense of 
computability theory. 

Definition 11 Let P be a recursive probability distribution on the sample space M- A total 
function 5 : J\f ^ N is a. P-test (Martin-Lof test for randomness) if: 

1. 5 is enumerable (the set V = {{m,x) : S{x) > m} is recursively enumerable); and 

2. Y.{Pix) ■■ Kx) > l{x) = n}< 2-", for all n. 

The critical regions associated with the common statistical tests are present in the form of 
the sequence Vi 5 V2 5 • • •, where Vm = {x : 6{x) > m}, for m > 1. Nesting is assured since 
5{x) > m + 1 implies 5{x) > ra. Each set Vm is recursively enumerable because of Item 1. 

A particularly important case is P is the uniform distribution, defined by L{x) = 2~^'(^). The 
restriction of L to strings of length n is defined by Ln{x) = 2"" for l{x) =n and otherwise. (By 
definition, Ln{x) = L{x\l{x) = n).) Then, Item 2 can be rewritten as J2xeVm ^n{x) < 2~"^ which 
is the same as 

di{x : lix) = n, x£ Vm}) < 2"-^". 
In this case we often speak simply of a test, with the uniform distribution L understood. 

In statistical tests membership of (m, x) in V can usually be determined in polynomial time in 
l{m) + l{x). 

Example 8 The previous test examples can be rephrased in terms of Martin-Lof tests. Let us try a 
more subtle example. A real number such that all bits in odd positions in its binary representation 
are I's is not random with respect to the uniform distribution. To show this we need a test which 
detects sequences of the form x = Ix2la;4lx6la;8 • • •• Define a test S by 

6{x) = max{i : xi = X3 = ■ ■ ■ = X2i-i = 1}, 
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and 6{x) = if XI = 0. For example: 5(01111) = 0; (5(10011) = 1; 5(11011) = 1; (5(10100) = 2; 

5(11111) = 3. To show that 5 is a test we have to show that 5 satisfies the definition of a test. 
Clearly, 5 is enumerable (even recursive). If 6{x) > m where l{x) = n > 2m, then there are 2™^-'^ 
possibilities for the (2m — l)-length prefix of x, and 2"~(^"*~^) possibilities for the remainder of x. 
Therefore, d{x : S{x) > m, l{x) = n} < 2"-"*. O 



Definition 12 A universal Martin-Lof test for randomness with respect to distribution P, a uni- 
versal P-test for short, is a test Sq{-\P) such that for each P-test S, there is a constant c, such that 
for all x, we have do{x\P) > d{x) — c. 

We say that So{-\P) (additively) majorizes 5. Intuitively, 5o{-\P) constitutes a test for ran- 
domness which incorporates all particular tests i5 in a single test. No test for randomness S 

other than Sq{-\P) can discover more than a constant amoimt more deficiency of randomness 
in any string x. In terms of critical regions, a universal test is a test such that if a binary 
sequence is random with respect to that test, then it is random with respect to any conceiv- 
able test, neglecting a change in significance level. Namely, with 6o{-\P) a universal P-tcst, let 
U = {(to, a;) : So{x\P) > m}, and, for any test S, let V = {(to, x) : 5{x) > to}. Then, defining 
the associated critical zones as before, we find 

Vm+c Q Um, TO = 1, 2, ... , 

where c is a constant (dependent only on U and V). 

It is a major result that there exists a universal P-test. The proof goes by first showing that the 
set of all tests is enumerable. 

Lemma 3 We can effectively enumerate all P-tests. 

Proof. We start with the standard enumeration (f)i,(f)2, . . . of partial recursive functions from A/" 
into N X A/", and turn this into an enumeration 5i, ^2, . . . of all and only P-tests. The list 4>i,4>2, ■ ■ ■ 
enumerates all and only recursively enumerable sets of pairs of integers as {(l)i{x) : a; > 1} for 
i = 1, 2, ... In particular, for any P-test 6, the set {(m, x) : 6{x) > m} occurs in this list. The only 
thing we have to do is to eliminate those (f)i of which the range does not correspond to a P-test. 

First, we effectively modify each (f) (we drop the subscript for convenience) to a function 
such that range (j) equals range ^|J, and ip has the special property that if 'tp{n) is defined, then 
■0(1), ^^"(2), . . . ,tp{n — 1) are also defined. This can be done by dovetailing the computations of 
(f) on the different arguments: in the first phase do one step of the computation of 0(1), in the 
second phase do the second step of the computation of 4>{1) and the first step of the computation 
of 0(2). In general, in the nth phase we execute the nith step of the computation of 0(n2), for all 
ni,n2 satisfying ni + n2 = n. We now define ip as follows. If the first computation that halts is 
that of 0(i), then set ^(1) := 0(i). If the second computation that halts is that of 0(j), then set 
■0(2) := 0(j), and so on. 

Secondly, use each to construct a test S by approximation from below. In the algorithm, at 
each stage of the computation the local variable array 5(1 : oo) contains the current approxima- 
tion to the hst of function values 5(1), 5(2), This is doable because the nonzero part of the 

approximation is always finite. 
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Step 1 Initialize 6 by setting 5{x) := for all x. {If the range of ip is empty, then this assignment 
will not be changed in the remainder of the procedure. That is, 5 stays identically zero and 
it is trivially a test.} Initialize i := 0. 

Step 2 Set i := i + 1; compute ip{i) and let its value be {x,m). 

Step 3 If 5{x) > m then go to Step 2. else set 5{x) := m. 

Step 4 If J2{P{y) ■ ^{y) — ^1 Kv) — K^)} > ^^"^ some /c, k = 1, . . . , m {since P is a recursive 
function we can effectively test whether the new value of 5{x) violates Definition ^} then set 
5{x) := and terminate {the computation of 5 is finished} else go to Step 2. 

(With P the uniform distribution, for i = 1 the conditional in Step 4 simplifies to m > l{x).) 
In case the range of ip is already a test, then the algorithm never finishes but forever approximates 
6 from below. If ip diverges for some argument then the computation goes on forever and does not 
change 6 any more. The resulting 6 is an enumerable test. If the range of ip is not a test, then 
at some point the conditional in Step 4 is violated and the approximation of 5 terminates. The 
resulting is a test, even a recursive one. Executing this procedure on all functions in the list 
, (^2 J • • • 1 we obtain an effective enumeration 5i,52, ■ ■ ■ of all P-tests (and only P-tests) . We are 
now in the position to define a universal P-test. □ 

Theorem 9 Let 61,62, ... be an enumeration of above P-tests. Then, 6o{x\P) = max{6y{x) — y : 
y > 1} is a universal P-test. 

Proof. Note first that 6q{-\P) is a total function on Af because of Item 2 in Definition 

(1) The enumeration 61,62,... in Lemma ^ yields an enumeration of recursively enumerable 
sets: 

{{m, x) : 6i{x) > m}, {{m, x) : 62{x) > m}, .... 

Therefore, V = {{m,x) : 6o{x\P) > m} is recursively enumerable. 

(2) Let us verify that the critical regions are small enough: for each n, 

CO 

J2 {Pix) ■■ 6o{x\P) >m} < J2 J2 iP(^) ■ ^y(^) >m + y} 

l{x)=n y=l l{x)=n 

CO 

< J2 2~™"^ = 2~™. 

y=l 

(3) By its definition, 6o{-\P) majorizes each 6 additively. Hence, it is universal. □ 

By definition of 6q{-\P) as a universal P-test, any particular P-test 6 can discover at most a 
constant amount more regularity in a sequence x than does 6o{-\P), in the sense that for each 6y 
we have 6y{x) < 6o{x\P)) + y for all x. 

For any two universal P-tests 6o{-\P) and 6'o{-\P), there is a constant c > 0, such that for all 
X, we have |(5o(x|P) — (5'o(x|P)| < c. 

We started out with the objective to establish in what sense incompressible strings may be 
called random. 
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Theorem 10 The function f{x) = l{x) — C{x\l{x)) — 1 is a universal L-test with L the uniform 
distribution. 

Proof. (1) We first show that f(x) is a test with respect to the uniform distribution. The 
set {{m,x) : f{x) > m} is recursiveiy enumerable since C() can be approximated from above by a 
recursive process. 

(2) We verify the condition on the critical regions. Since the number of x's with C{x\l{x)) < 
l[x) — m — 1 cannot exceed the number of programs of length at most l{x) — m — 1, we have 
d{{x : f{x) > m}) < 2^(^)-^ - 1. 

(3) We show that for each test 6, there is a constant c, such that f{x) > 6{x) — c. The main 
idea is to bound C{x\l{x)) by exhibiting a description of x, given l{x). Fix x. Let the set A be 
defined as 

A = {z: 5{z) > 5{x),l{z) = l{x)}. 

We have defined A such that x G A and d{A) < 2'(^)~^(^). Let 6 = 6y in the standard enumeration 
(5i, ^2, . . . of tests. Given y, l{x), and 5{x), we can enumerate all elements of A. Together with x's 
index j in enumeration order of A, this suffices to find x. We pad the standard binary representation 
of j with nonsignificant zeros to a string s = 00 . . . Oj of length l{x) — 6{x). This is possible since 
Hs) > l{d(A)). The purpose of changing j to s is that now the number 6{x) can be deduced from 
l{s) and l{x). In particular, there is a Turing machine which computes x from input ys, when 
l{x) is given for free. Consequently, since C{) is the shortest effective description, C{x\l{x)) < 
l{x) — 5{x) + 2/(y) + 1. Since y is a constant depending only on 5, we can set c = 2/(y) + 2. □ 

In Theorem we have exhibited a universal P-test for randomness of a string x of length n 
with respect to an arbitrary recursive distribution P over the sample set S = iS" with B = {0, 1}. 

The universal P-test measures how justified is the assumption that x is the outcome of an 
experiment with distribution P. We now use m to investigate alternative characterizations of 
random elements of the sample set S = B* (equivalently, S = N) . 

Definition 13 Let P be a recursive probability distribution on J\f . A sum P-test is a nonnegative 
enumerable function 5 satisfying 

^P(2;)2^(^) < 1. (22) 

X 

A universal sum P-test is a test that additively dominates each sum P-test. 



The sum tests of Definition 13 are slightly stronger than the tests according to Martin-Lof's 



original Definition 11 



Lemma 4 Each sum P-test is a P-test. If 6{x) is a P-test, then there is a constant c such that 
5'{x) = 5{x) — 21og(5(x) — c is a sum P-test. 

Proof. It follows immediately from the new definition that for all n 

^{P(x) : 6{x) > k, l{x) =n}< 2"^ (23) 
Namely, if ( [2^ ) is false, then we contradict (^2|) by 

^ P(x)2^(^) > ^ 1- 

xGA/" l{x)=n 
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Conversely, if 6{x) satisfies (|2^ ) for all n, then for some constant c, the function 5{x)—2 log 6{x) — 
c satisfies (p^). □ 

This shows that the sum test is not much stronger than the original test. One advantage of 
(|2^ ) is that it is just one inequality, instead of infinitely many, one for each n. We give an exact 
expression for a universal sum P-test in terms of complexity. 

Theorem 11 Let P be a recursive probability distribution. The function 

kq{x\P) = log(m(x)/P(a;)) 

is a universal sum P-test. 

Proof. Since m is enumerable, and P is recursive, kq{x\P) is enumerable. We first show that 
kq{x\P) is a sum P-test: 

P(x)2''°(^'l-P) = < 1. 

X X 

It is only left to show that kq{x\P) additively dominates all sum P-tests. For each sum P-test 5, 
the function P(x)2''(^) is a semimeasure that is enumerable. It has been shown, Section |A|, that 
there is a positive constant c such that c • m(x) > P{x)2^^^\ Hence, there is another constant c 
such that c • HiQ{x\P) > ^(x), for all x. □ 

Example 9 An important case is as follows. If we consider a distribution P restricted to a domain 
A C M, then the universal sum P-test becomes \og{m{x\A) / P[x\A)). For example, if -L„ is the 
uniform distribution on A = {0, 1}", then the universal sum L„-test for x £ A becomes 

Ko{x\Ln) = log(m(x|^)/L„(x)) = n- K{x\n). 

Namely, Ln{x) = 1/2" and logm(j;|^) = —K{x\A) by the Coding Theorem, Section |A[ where we 
can describe A by giving n. O 



Example 10 The Noiseless Coding Theorem states that the Shannon- Fano code, which codes a 
source word x straightforwardly as a word of about — logP(x) bits. Section nearly achieves the 
optimal expected code word length. This code is uniform in the sense that it does not use any 
characteristics of x itself to associate a code word with a source word x. The code that codes 
each source word x as a code word of length K{x) also achieves the optimal expected code word 
length. This code is nonuniform in that it uses characteristics of individual x's to obtain shorter 
code words. Any difference in code word length between these two encodings for a particular object 
X can only be due to exploitation of the individual regularities in x. 

Define the randomness deficiency of a finite object x with respect to P as 

— [log P(x)J — K{x) = — logP(x) -I- logm(a;) = ko{x\P), 

by the major theorems in Section j^. That is, the randomness deficiency is the outcome of the 
universal sum P-test of Theorem |ll|. O 
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Example 11 Let us compare the randomness deficiency as measured by kq{x\P) with that mea- 
sured by the universal test 5(){x), for the uniform distribution. That test consisted actually of 
tests for a whole family L„ of distributions, where Ln is the uniform distribution such that each 
Ln{x) = 2~"' for l{x) = n, and zero otherwise. Rewrite 5o{x) as 

6o{x\Ln) = n — C(x|n), 

for l{x) = n, and oo otherwise. This is close to the expression for ko(x|L„) obtained in Example]^. 



^From the relations between C and in [17| it follows that 



\5o{x\Ln) - Ko{x\Ln)\ < 21ogC(x). 

The formulation of the universal sum test in Theorem |ll| can be interpreted as follows. An 
element x is random with respect to distribution P, that is, ko{x\P) =, if P{x) is large enough, 
not in absolute value but relative to in{x). If we did not have this relativization, then we would 
not be able to distinguish between random and nonrandom outcomes for the uniform distribution 
Ln{x) above. 

Let us look at an example. Let x = 00 ... of length n. Then, ko(x|L„) = n — K{x\n) = n. 
If we flip a coin n times to generate y, then with overwhelming probability K{y\n) > n and 
Ko(y|L„) = 0(1). O 



Example 12 According to modern physics, electrons, neutrons and protons satisfy the Fermi-Dirac 
distribution. We distribute n particles among k cells, for n < k, such that each cell is occupied by 
at most one particle; and all distinguished arrangements satisfying this have the same probability. 

We can treat each arrangement as a binary string: an empty cell is a zero and a cell with a 
particle is a one. Since there are (^) possible arrangements, the probability for each arrangement 
X to happen, under the Fermi-Dirac distribution, is ^(x) = (^) . According to Theorem 



Ko{x\FDn,k) = ^og{m{x\k,n)/FDn,k{x)) = -K{x\n,k) + log 




is a universal sum test with respect to the Fermi-Dirac distribution. It is easy to see that a binary 
string X of length k with n ones has complexity K{x\n,k) < log(,|^), and K{x\n,k) > log (^) for 
most such X. Hence, a string x with maximal K{x\n,k) will pass this universal sum test. Each 
individual such string possesses all effectively testable properties of typical strings under the Fermi- 
Dirac distribution. Hence, in the limit for n and k growing unboundedly, we cannot effectively 
distinguish one such a string from other such strings. O 



Example 13 Markov's Inequality says the following. Let P be any probability distribution, / 
any nonnegative function with P-expected value E = (^)/(^) < E > we have 

: /(x)/E >k}< l/k. 

Let P be any probability distribution (not necessarily recursive). The P-expected value of 
m(x)/P(x) is 

V P(x)^ < 1 
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Then, by Markov's Inequality 



^{P(x) : m(x) < kP{x)} > 1 



1 

k' 



(24) 



X 



Since m dominates all enumerable semimeasures multiplicatively, we have for all x 



P{x) < cpm(x), and it can be shown cp 



(25) 



Equations (^, 25) have the following consequences. 



1. If X is a random sample from a simple recursive distribution P, where "simple" means that 
K[P) is small, then m is a good estimate for P. For instance, if x is randomly drawn from 
distribution P, then the probability that 



is at least 1 — 1/cp. 

2. If we know or believe that x is random with respect to P, and we know P{x), then we can 
use P{x) as an estimate of m(x). 

In both cases the degree of approximation depends on the index of P, and the randomness of 
x with respect to P, as measured by the randomness deficiency ko{x\P) = log(m(a;)/P(x)). For 
example, the uniform discrete distribution on B* can be defined by L{x) = 2^^'(^). Then, for each 
n we have Ln{x) = L{x\l{x) = n). To describe L takes 0(1) bits, and therefore 

kq{x\L) = l{x) — K{x). 

The randomness deficiency kq(x\L) = iff K{x) > l{x), that is, iff x is random. 

The nonrecursive "distribution" m(x) = 2~^^^^ has the remarkable property that the test 
Ko(x|m) = for all x: the test shows all outcomes x random with respect to it. We can interpret 
( p^ , p^ ) as saying that if the real distribution is P, then P(x) and m(x) are close to each other 
with large P-probability. Therefore, if x comes from some unknown recursive distribution P, then 
we can use m(x) as an estimate for P(x). In other words, m(x) can be viewed as the universal a 
priori probability' of x. 

The universal sum P-test ko(x|P) can be interpreted in the framework of hypothesis testing 
as the likelihood ratio between hypothesis P and the fixed alternative hypothesis m. In ordinary 
statistical hypothesis testing, some properties of an unknown distribution P are taken for granted, 
and the role of the universal test can probably be reduced to some tests that are used in statistical 
practice. O 

References 

[1] E. Asmis, Epicurus Scientific Method, Cornell University Press, 1984. 

[2] T. Bayes, An essay towards solving a problem in the doctrine of cha nces, Philos. Trans. Roy. 



Cp m(x) < P(x) < cpm(x) 



Soc, 53:376-398 and 54:298-31, 1764. 



33 



[3] A.R. Barron and T.M. Cover, Minimum complexity density estimation, IEEE Trans. Inform. 
Theory, IT-37(1991), 1034-1054. 

[4] R. Carnap, Logical Foundations of Probability Univ. Chicago Press, 1950. 

[5] T.M. Cover and J. A. Thomas, Elements of Information Theory, Wiley, New York, 1991. 

[6] G.J. Chaitin, A theory of program size formally identical to information theory, J. Assoc. 
Comput. Mach., 22(1975), 329-340. 

[7] A. P. Dawid, Prequential analysis, stochastic complexity, and Bayesian inference. In: Bayesian 
Statistics 4, J.M. Bernardo, J.O. Berger, A.P. Dawid, and A.F.M. Smith (Eds.), 1992. 

[8] J.L. Doob, Stochastic Processes , Wiley, 1953. 

[9] P. Gacs, On the symmetry of algorithmic information, Soviet Math. Dokl, 15 (1974) 1477-1480. 
Correction: ibid., 15 (1974) 1480. 

[10] P. Gacs, On the relation between descriptional complexity and algorithmic probability, Theoret. 
Comput. Sci., 22(1983), 71-93. 

[11] D. Hume, Treatise of Human Nature, Book I, 1739. 

[12] A.N. Kolmogorov, Three approaches to the quantitative definition of information. Problems 
Inform. Transmission 1:1 (1965) 1-7. 

[13] A.N. Kolmogorov, On logical foundations of probability theory, pp. 1-5 in Probability Theory 
and Mathematical Statistics, Led. Notes Math., Vol. 1021, K. Ito and Yu.V. Prokhorov, Eds., 
Springer- Verlag, Heidelberg, 1983. 

[14] A.N. Kolmogorov and V.A. Uspensky, Algorithms and Randomness, SIAM J. Theory Probab. 
AppL, 32(1987), 389-412. Without annoying translation errors pp. 3-53 in: Yu.V. Prokhorov 
and V.V. Sazonov, Eds., Proc. 1st World Congress of the Bernoulli Society (Tashkent 1986), 
Vol. 1: Probab. Theory and AppL, VNU Science Press, Utrecht, 1987. 

[15] L.A. Levin, On the notion of a random sequence, Soviet Math. Dokl, 14(1973), 1413-1416. 

[16] L.A. Levin, Laws of information conservation (non-growth) and aspects of the foundation of 
probability theory. Problems Inform. Transmission, 10(1974), 206-210. 

[17] M. Li and P.M.B. Vitanyi, An Introduction to Kolmogorov Complexity and its Applications, 
Springer- Verlag, New York, 1993. 

[18] P. Martin-Lof, The definition of random sequences. Inform. Contr., 9(1966), 602-619. 

[19] N. Merhav and M. Feder, A strong version of the redundancy-capacity theorem of universal 
coding, IEEE Trans. Inform. Theory, IT-41(1995), 714-722. 

[20] K.R. Popper, The Logic of Scientific Discovery, University of Toronto Press, 1959. 

[21] J.J. Rissanen, Modehng by the shortest data description, Automatica-J.IFAC 14 (1978) 465- 
471. 



34 



[22] J.J. Rissanen, Stochastic Complexity and Modelling, The Annals of Statistics, 14(1986), 1080- 
1100. Also: J.J. Rissanen, Stochastic Complexity and Statistical Inquiry, World Scientific Pub- 
lishers, 1989. 

[23] J.J. Rissanen, Fisher information and stochastic complexity, IEEE Trans. Inform. Theory, 
IT-42:1(1996), 40-47. 

[24] CP. Schnorr, Zufdlligkeit und Wahrscheinlichkeit; Eine algorithmische Begriindung der 
Wahrscheinlichkeitstheorie, Lcct. Notes Math., Vol. 218, Springer- Vcrlag, Heidelberg, 1971. 
See also: CP. Schnorr, A survey of the theory of random sequences, pp. 193-210 in: Basic 
Problems in Methodology and Linguistics,, R.E. Butts and J. Hintikka, Eds., Reidel, 1977. 

[25] R.J. Solomonoff, A formal theory of inductive inference. Part 1 and Part 2, Inform. Contr., 
7(1964), 1-22, 224-254. 

[26] R.J. Solomonoff, Complexity-based induction systems: comparisons and convergence theorems, 
IEEE Trans. Inform. Theory IT-24 (1978) 422-432. 

[27] D. Sow and A. Elefteriadis, Complexity distortion theory. Submitted to IEEE Trans. Inform. 
Theory, 1997. 

[28] A.M. Turing, On computable numbers with an application to the Entscheidungsproblem, Proc. 
London Math. Soc, Ser. 2, 42(1936), 230-265; Correction, Ibid, 43(1937), 544-546. 

[29] R. von Mises, Grundlagen der Wahrscheinlichkeitsrechnung, Mathemat. Zeitsch., 5(1919), 52- 
99. 

[30] V. Vovk, Minimum description length estimators under the universal coding scheme, in: P. 

Vitanyi (Ed.), Computational Learning Theory, Proc. 2nd European Conf. (EuroCOLT '95), 
Lecture Notes in Artificial Intelligence, Vol. 904, Springer- Verlag, Heidelberg, 1995, pp. 237- 
251; Learning about the parameter of the Bernoulli model, J. Com,put. System Sci., to appear. 

[31] CS. Wallace and D.M. Boulton, An information measure for classification. Computing Journal 
11 (1968) 185-195. 

[32] CS. Wallace and P.R. Freeman, Estimation and inference by compact coding, J. Royal Stat. 
Soc, Series B, 49 (1987) 240-251. Discussion: ibid., 252-265. 

[33] K. Yamanishi, A Randomized Approximation of the MDL for Stochastic Models with Hidden 
Variables, Proc. 9th ACM Comput. Learning Conference, ACM Press, 1996. 

[34] A.K. Zvonkin and L.A. Levin, The complexity of finite objects and the development of the 
concepts of information and randomness by means of the theory of algorithms, Russian Math. 
Surveys 25:6 (1970) 83-124. 



35 



